1,345 research outputs found

    A Simple Data-Adaptive Probabilistic Variant Calling Model

    Full text link
    Background: Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the reference genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments. Results: We introduce a simple data-adaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical log-likelihoods. These likelihoods are then combined to a score that typically gives rise to a mixture distribution. From these we determine a decision threshold to separate potentially variant sites from the noisy background. Conclusions: In simulations we show that our simple proposed model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.Comment: 19 pages, 6 figure

    ORFer – retrieval of protein sequences and open reading frames from GenBank and storage into relational databases or text files

    Get PDF
    BACKGROUND: Functional genomics involves the parallel experimentation with large sets of proteins. This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins. RESULTS: A Java program was developed for retrieval of protein and nucleic acid sequences and annotations from NCBI GenBank, using the XML sequence format. Annotations retrieved by ORFer include sequence name, organism and also the completeness of the sequence. The program has a graphical user interface, although it can be used in a non-interactive mode. For protein sequences, the program also extracts the open reading frame sequence, if available, and checks its correct translation. ORFer accepts user input in the form of single or lists of GenBank GI identifiers or accession numbers. It can be used to extract complete sets of open reading frames and protein sequences from any kind of GenBank sequence entry, including complete genomes or chromosomes. Sequences are either stored with their features in a relational database or can be exported as text files in Fasta or tabulator delimited format. The ORFer program is freely available at http://www.proteinstrukturfabrik.de/orfer. CONCLUSION: The ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank. Furthermore, storage of sequences and features in a relational database is supported. Such a database can supplement a laboratory information system (LIMS) with appropriate sequence information

    Genome Informatics for High-Throughput Sequencing Data Analysis: Methods and Applications

    Get PDF
    This thesis introduces three different algorithmical and statistical strategies for the analysis of high-throughput sequencing data. First, we introduce a heuristic method based on enhanced suffix arrays to map short sequences to larger reference genomes. The algorithm builds on the idea of an error-tolerant traversal of the suffix array for the reference genome in conjunction with the concept of matching statistics introduced by Chang and a bitvector based alignment algorithm proposed by Myers. The algorithm supports paired-end and mate-pair alignments and the implementation offers methods for primer detection, primer and poly-A trimming. In our own benchmarks as well as independent bench- marks this tool outcompetes other currently available tools with respect to sensitivity and specificity in simulated and real data sets for a large number of sequencing protocols. Second, we introduce a novel dynamic programming algorithm for the spliced alignment problem. The advantage of this algorithm is its capability to not only detect co-linear splice events, i.e. local splice events on the same genomic strand, but also circular and other non-collinear splice events. This succinct and simple algorithm handles all these cases at the same time with a high accuracy. While it is at par with other state- of-the-art methods for collinear splice events, it outcompetes other tools for many non-collinear splice events. The application of this method to publically available sequencing data led to the identification of a novel isoform of the tumor suppressor gene p53. Since this gene is one of the best studied genes in the human genome, this finding is quite remarkable and suggests that the application of our algorithm could help to identify a plethora of novel isoforms and genes. Third, we present a data adaptive method to call single nucleotide variations (SNVs) from aligned high-throughput sequencing reads. We demonstrate that our method based on empirical log-likelihoods automatically adjusts to the quality of a sequencing experiment and thus renders a \"decision\" on when to call an SNV. In our simulations this method is at par with current state-of-the-art tools. Finally, we present biological results that have been obtained using the special features of the presented alignment algorithm.Diese Arbeit stellt drei verschiedene algorithmische und statistische Strategien für die Analyse von Hochdurchsatz-Sequenzierungsdaten vor. Zuerst führen wir eine auf enhanced Suffixarrays basierende heuristische Methode ein, die kurze Sequenzen mit grossen Genomen aligniert. Die Methode basiert auf der Idee einer fehlertoleranten Traversierung eines Suffixarrays für Referenzgenome in Verbindung mit dem Konzept der Matching-Statistik von Chang und einem auf Bitvektoren basierenden Alignmentalgorithmus von Myers. Die vorgestellte Methode unterstützt Paired-End und Mate-Pair Alignments, bietet Methoden zur Erkennung von Primersequenzen und zum trimmen von Poly-A-Signalen an. Auch in unabhängigen Benchmarks zeichnet sich das Verfahren durch hohe Sensitivität und Spezifität in simulierten und realen Datensätzen aus. Für eine große Anzahl von Sequenzierungsprotokollen erzielt es bessere Ergebnisse als andere bekannte Short-Read Alignmentprogramme. Zweitens stellen wir einen auf dynamischer Programmierung basierenden Algorithmus für das spliced alignment problem vor. Der Vorteil dieses Algorithmus ist seine Fähigkeit, nicht nur kollineare Spleiß- Ereignisse, d.h. Spleiß-Ereignisse auf dem gleichen genomischen Strang, sondern auch zirkuläre und andere nicht-kollineare Spleiß-Ereignisse zu identifizieren. Das Verfahren zeichnet sich durch eine hohe Genauigkeit aus: während es bei der Erkennung kollinearer Spleiß-Varianten vergleichbare Ergebnisse mit anderen Methoden erzielt, schlägt es die Wettbewerber mit Blick auf Sensitivität und Spezifität bei der Vorhersage nicht-kollinearer Spleißvarianten. Die Anwendung dieses Algorithmus führte zur Identifikation neuer Isoformen. In unserer Publikation berichten wir über eine neue Isoform des Tumorsuppressorgens p53. Da dieses Gen eines der am besten untersuchten Gene des menschlichen Genoms ist, könnte die Anwendung unseres Algorithmus helfen, eine Vielzahl weiterer Isoformen bei weniger prominenten Genen zu identifizieren. Drittens stellen wir ein datenadaptives Modell zur Identifikation von Single Nucleotide Variations (SNVs) vor. In unserer Arbeit zeigen wir, dass sich unser auf empirischen log-likelihoods basierendes Modell automatisch an die Qualität der Sequenzierungsexperimente anpasst und eine \"Entscheidung\" darüber trifft, welche potentiellen Variationen als SNVs zu klassifizieren sind. In unseren Simulationen ist diese Methode auf Augenhöhe mit aktuell eingesetzten Verfahren. Schließlich stellen wir eine Auswahl biologischer Ergebnisse vor, die mit den Besonderheiten der präsentierten Alignmentverfahren in Zusammenhang stehen

    Direct measurement of molecular stiffness and damping in confined water layers

    Get PDF
    We present {\em direct} and {\em linear} measurements of the normal stiffness and damping of a confined, few molecule thick water layer. The measurements were obtained by use of a small amplitude (0.36 A˚\textrm{\AA}), off-resonance Atomic Force Microscopy (AFM) technique. We measured stiffness and damping oscillations revealing up to 7 layers separated by 2.56 ±\pm 0.20 A˚\textrm{\AA}. Relaxation times could also be calculated and were found to indicate a significant slow-down of the dynamics of the system as the confining separation was reduced. We found that the dynamics of the system is determined not only by the interfacial pressure, but more significantly by solvation effects which depend on the exact separation of tip and surface. Thus ` solidification\rq seems to not be merely a result of pressure and confinement, but depends strongly on how commensurate the confining cavity is with the molecule size. We were able to model the results by starting from the simple assumption that the relaxation time depends linearly on the film stiffness.Comment: 7 pages, 6 figures, will be submitted to PR

    Fast local fragment chaining using sum-of-pair gap costs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Fast seed-based alignment heuristics such as <monospace>BLAST</monospace> and <monospace>BLAT</monospace> have become indispensable tools in comparative genomics for all studies aiming at the evolutionary relations of proteins, genes, and non-coding RNAs. This is true in particular for the large mammalian genomes. The sensitivity and specificity of these tools, however, crucially depend on parameters such as seed sizes or maximum expectation values. In settings that require high sensitivity the amount of short local match fragments easily becomes intractable. Then, fragment chaining is a powerful leverage to quickly connect, score, and rank the fragments to improve the specificity.</p> <p>Results</p> <p>Here we present a fast and flexible fragment chainer that for the first time also supports a sum-of-pair gap cost model. This model has proven to achieve a higher accuracy and sensitivity in its own field of application. Due to a highly time-efficient index structure our method outperforms the only existing tool for fragment chaining under the linear gap cost model. It can easily be applied to the output generated by alignment tools such as <monospace>segemehl</monospace> or <monospace>BLAST</monospace>. As an example we consider homology-based searches for human and mouse snoRNAs demonstrating that a highly sensitive <monospace>BLAST</monospace> search with subsequent chaining is an attractive option. The sum-of-pair gap costs provide a substantial advantage is this context.</p> <p>Conclusions</p> <p>Chaining of short match fragments helps to quickly and accurately identify regions of homology that may not be found using local alignment heuristics alone. By providing both the linear and the sum-of-pair gap cost model, a wider range of application can be covered. The software clasp is available at <url>http://www.bioinf.uni-leipzig.de/Software/clasp/</url>.</p

    Dichotomous Impact of Myc on rRNA Gene Activation and Silencing in B Cell Lymphomagenesis

    Get PDF
    A major transcriptional output of cells is ribosomal RNA (rRNA), synthesized by RNA polymerase I (Pol I) from multicopy rRNA genes (rDNA). Constitutive silencing of an rDNA fraction by promoter CpG methylation contributes to the stabilization of these otherwise highly active loci. In cancers driven by the oncoprotein Myc, excessive Myc directly stimulates rDNA transcription. However, it is not clear when during carcinogenesis this mechanism emerges, and how Myc-driven rDNA activation affects epigenetic silencing. Here, we have used the E&micro;-Myc mouse model to investigate rDNA transcription and epigenetic regulation in Myc-driven B cell lymphomagenesis. We have developed a refined cytometric strategy to isolate B cells from the tumor initiation, promotion, and progression phases, and found a substantial increase of both Myc and rRNA gene expression only in established lymphoma. Surprisingly, promoter CpG methylation and the machinery for rDNA silencing were also strongly up-regulated in the tumor progression state. The data indicate a dichotomous role of oncogenic Myc in rDNA regulation, boosting transcription as well as reinforcing repression of silent repeats, which may provide a novel angle on perturbing Myc function in cancer cells

    Cancer Risks near Nuclear Facilities: The Importance of Research Design and Explicit Study Hypotheses

    Get PDF
    BackgroundIn April 2010, the U.S. Nuclear Regulatory Commission asked the National Academy of Sciences to update a 1990 study of cancer risks near nuclear facilities. Prior research on this topic has suffered from problems in hypothesis formulation and research design.ObjectivesWe review epidemiologic principles used in studies of generic exposure–response associations and in studies of specific sources of exposure. We then describe logical problems with assumptions, formation of testable hypotheses, and interpretation of evidence in previous research on cancer risks near nuclear facilities.DiscussionAdvancement of knowledge about cancer risks near nuclear facilities depends on testing specific hypotheses grounded in physical and biological mechanisms of exposure and susceptibility while considering sample size and ability to adequately quantify exposure, ascertain cancer cases, and evaluate plausible confounders.ConclusionsNext steps in advancing knowledge about cancer risks near nuclear facilities require studies of childhood cancer incidence, focus on in utero and early childhood exposures, use of specific geographic information, and consideration of pathways for transport and uptake of radionuclides. Studies of cancer mortality among adults, cancers with long latencies, large geographic zones, and populations that reside at large distances from nuclear facilities are better suited for public relations than for scientific purposes

    Altered Glycosylation in the Aging Heart

    Get PDF
    Cardiovascular disease is one of the leading causes of death in developed countries. Because the incidence increases exponentially in the aging population, aging is a major risk factor for cardiovascular disease. Cardiac hypertrophy, fibrosis and inflammation are typical hallmarks of the aged heart. The molecular mechanisms, however, are poorly understood. Because glycosylation is one of the most common post-translational protein modifications and can affect biological properties and functions of proteins, we here provide the first analysis of the cardiac glycoproteome of mice at different ages. Western blot as well as MALDI-TOF based glycome analysis suggest that high-mannose N -glycans increase with age. In agreement, we found an age-related regulation of GMPPB, the enzyme, which facilitates the supply of the sugar-donor GDP-mannose. Glycoprotein pull-downs from heart lysates of young, middle-aged and old mice in combination with quantitative mass spectrometry bolster widespread alterations of the cardiac glycoproteome. Major hits are glycoproteins related to the extracellular matrix and Ca 2+ -binding proteins of the endoplasmic reticulum. We propose that changes in the heart glycoproteome likely contribute to the age-related functional decline of the cardiovascular system

    A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection

    Get PDF
    Numerous high-throughput sequencing studies have focused on detecting conventionally spliced mRNAs in RNA-seq data. However, non-standard RNAs arising through gene fusion, circularization or trans-splicing are often neglected. We introduce a novel, unbiased algorithm to detect splice junctions from single-end cDNA sequences. In contrast to other methods, our approach accommodates multi-junction structures. Our method compares favorably with competing tools for conventionally spliced mRNAs and, with a gain of up to 40% of recall, systematically outperforms them on reads with multiple splits, trans-splicing and circular products
    corecore